Pymc3 and the NBA

Welcome back to the fifth blog in the BA blog series. The theme of today's blog is, by far, one of my favourite topics - sports. I will demonstrate the power that pymc3, a Bayesian statistical modelling package for python, has in the data science field.

To begin, we are going to import the usual suspects, along with my favourite way to play with NBA sports data, the nba_api package.

import pandas as pd
import nba_api
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm

from nba_api.stats.static import teams
from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import teamyearbyyearstats

pd.set_option('display.max_columns',100)

This next step is calling the three teams we are going to be looking at closer today.

df will be the Dallas Mavericks df1 will be the San Antonio Spurs df2 will be the Utah Jazz

team_dict = teams.get_teams()

mavericks = [team for team in team_dict if team['full_name']=='Dallas Mavericks'][0]
spurs = [team for team in team_dict if team['full_name']=='San Antonio Spurs'][0]
raptors = [team for team in team_dict if team['full_name']=='Toronto Raptors'][0]

mavericks_id = mavericks['id']
spurs_id = spurs['id']
raptors_id = raptors['id']

df = teamyearbyyearstats.TeamYearByYearStats(mavericks_id).get_data_frames()[0]
df1 = teamyearbyyearstats.TeamYearByYearStats(spurs_id).get_data_frames()[0]
df2 = teamyearbyyearstats.TeamYearByYearStats(raptors_id).get_data_frames()[0]

The next two steps are a little cleaning to make the data more workable.

df.columns = map(str.lower, df.columns)
df1.columns = map(str.lower, df.columns)
df2.columns = map(str.lower, df.columns)

df['year'] = df['year'].str.split('-').str[1]
df1['year'] = df1['year'].str.split('-').str[1]
df2['year'] = df2['year'].str.split('-').str[1]

Now, let's plot the three teams using seaborn.

Notice that we are plotting each year the franchise has been around on the X-axis and the number of wins on the Y-axis.

plt.style.use('seaborn')
fig, ax = plt.subplots(nrows=3, ncols=1, sharex=False, sharey=False,figsize=(20,15))

ax[0].bar(df['year'],df['wins'], color='blue', label = 'Dallas Mavericks')
ax[1].bar(df1['year'],df1['wins'], color='black', label = 'San Antonio Spurs')
ax[2].bar(df2['year'],df2['wins'], color='purple', label = 'Toronto Raptors')
ax[0].legend(loc='upper left')
ax[1].legend(loc='upper left')
ax[2].legend(loc='upper left')
plt.show()

png

Now the real magic begins - bring in pymc3!

The first pymc3 model we are going to run is on the Dallas Mavericks. We are going to run a model that has to lambdas, which you can see run using Normal distribution. These lambdas will measure each season wins total and try to notice a "change."

The change is the other import part of this model and its called the tau. The tau will return a number that represents when the model identifies a shift.

Running our first model:

with pm.Model() as model:
    lambda_1 = pm.Normal('lambda_1', 10, 20)
    lambda_2 = pm.Normal('lambda_2', 10, 20)
    tau = pm.DiscreteUniform("tau", lower=5, upper=35)

    idx = np.arange(len(df)) # Index
    lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)

    observation = pm.Normal("obs", lambda_, observed=df['wins'])
    trace = pm.sample(10000, tune=1000, chains=2)

Multiprocess sampling (2 chains in 2 jobs)
CompoundStep
>NUTS: [lambda_2, lambda_1]
>Metropolis: [tau]
Sampling 2 chains, 0 divergences: 100%|██████████| 22000/22000 [00:15<00:00, 1400.18draws/s]
The number of effective samples is smaller than 10% for some parameters.

## Dirk
pm.summary(trace)

	mean	sd	hpd_3%	hpd_97%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
lambda_1	31.872	0.282	31.367	32.425	0.005	0.004	2730.0	2716.0	2894.0	5336.0	1.0
lambda_2	47.746	0.270	47.268	48.282	0.005	0.004	2776.0	2768.0	2956.0	5347.0	1.0
tau	19.215	0.411	19.000	20.000	0.012	0.009	1098.0	1098.0	1098.0	1098.0	1.0

Now for those of you that follow the NBA, you will have heard of someone called Dirk Nowitzki. He played his entire career for the Mavericks and was a perennial all start, which brought a championship to that team.

From the above summary, you can see that our model noticed a change on the 19th bar of the Dallas graph which correlates to the 98-99 season.

Drumroll, please, Dirk Nowitzki was drafted by the Dallas Mavericks in 98-99. Add the shocked monkey face from the previous blog. That is amazing that pymc3 noticed that something significant happened at this time.

Still a skeptic of pymc3, let's do another.

with pm.Model() as model:
    lambda_1 = pm.Normal('lambda_1', 10, 20)
    lambda_2 = pm.Normal('lambda_2', 10, 20)
    tau = pm.DiscreteUniform("tau", lower=5, upper=50)

    idx = np.arange(len(df1)) # Index
    lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)

    observation = pm.Normal("obs", lambda_, observed=df1['wins'])
    trace = pm.sample(10000, tune=1000, chains=2)

Multiprocess sampling (2 chains in 2 jobs)
CompoundStep
>NUTS: [lambda_2, lambda_1]
>Metropolis: [tau]
Sampling 2 chains, 0 divergences: 100%|██████████| 22000/22000 [00:16<00:00, 1373.07draws/s]

## Timmy D
pm.summary(trace)

	mean	sd	hpd_3%	hpd_97%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
lambda_1	44.651	0.212	44.255	45.042	0.002	0.001	18282.0	18259.0	18285.0	12643.0	1.0
lambda_2	55.134	0.219	54.725	55.547	0.002	0.001	18181.0	18181.0	18175.0	13052.0	1.0
tau	23.000	0.000	23.000	23.000	0.000	0.000	20000.0	20000.0	20000.0	20000.0	NaN

You notice from the summary that the tau is at 23 now, which correlates to the 98-99 season. That was the second year for arguably the best power forward to ever play in the NBA - Tim Duncan. He also went on to win multiple championships for his San Antonio Spurs.

Notice the only difference from the first model is the increase upper limit for tau because this team was around longer than the Dallas Mavericks.

Do you want to see one more? Of course, you do.

with pm.Model() as model:
    lambda_1 = pm.Normal('lambda_1', 10, 20)
    lambda_2 = pm.Normal('lambda_2', 10, 20)
    tau = pm.DiscreteUniform("tau", lower=5, upper=15)

    idx = np.arange(len(df2)) # Index
    lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)

    observation = pm.Normal("obs", lambda_, observed=df2['wins'])
    trace = pm.sample(10000, tune=1000, chains=2)

Multiprocess sampling (2 chains in 2 jobs)
CompoundStep
>NUTS: [lambda_2, lambda_1]
>Metropolis: [tau]
Sampling 2 chains, 0 divergences: 100%|██████████| 22000/22000 [00:15<00:00, 1383.41draws/s]

pm.summary(trace)

	mean	sd	hpd_3%	hpd_97%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
lambda_1	30.995	0.305	30.429	31.569	0.002	0.002	18299.0	18294.0	18313.0	14028.0	1.0
lambda_2	42.921	0.268	42.425	43.425	0.002	0.001	18507.0	18507.0	18513.0	14695.0	1.0
tau	11.000	0.000	11.000	11.000	0.000	0.000	20000.0	20000.0	20000.0	20000.0	NaN

This last one has a special place in my heart as a Canadian. The tau is 11, which correlates to the 2006-07 season for the Raptors. The first time the Raptors won a division title and maybe the season that built the culture, which helped them win the NBA championship in the 2018-19 season.

And with that, I hope you see the power of pymc3. It needs minimal inputs to make it work and is immensely powerful.

Go forth and use it for good!